H2O: A hybrid and hierarchical outlier detection method for large scale data protection

نویسندگان

  • Quan Zhang
  • Mu Qiao
  • Ramani R. Routray
  • Weisong Shi
چکیده

Data protection is the process of backing up data in case of a data loss event. It is one of the most critical routine activities for every organization. Detecting abnormal backup jobs is important to prevent data protection failures and ensure the service quality. Given the large scale backup endpoints and the variety of backup jobs, from a backup-asa-service provider viewpoint, we need a scalable and flexible outlier detection method that can model a huge number of objects and well capture their diverse patterns. In this paper, we introduce H2O, a novel hybrid and hierarchical method to detect outliers from millions of backup jobs for large scale data protection. Our method automatically selects an ensemble of outlier detection models for each multivariate time series composed by the backup metrics collected for each backup endpoint by learning their exhibited characteristics. Interactions among multiple variables are considered to better detect true outliers and reduce false positives. In particular, a new seasonal-trend decomposition based outlier detection method is developed, considering the interactions among variables in the form of common trends, which is robust to the presence of outliers in the training data. The model selection process is hierarchical, following a global to local fashion. The final outlier is determined through an ensemble learning by multiple models. Built on top of Apache Spark, H2O has been deployed to detect outliers in a large and complex data protection environment with more than 600,000 backup endpoints and 3 million daily backup jobs. To the best of our knowledge, this is the first work that selects and constructs large scale outlier detection models for multivariate time series on big data platforms. Keywords-anomaly detection, multivariate time series, hybrid, hierarchical, big data, model selection

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detecting Suspicious Card Transactions in unlabeled data of bank Using Outlier Detection Techniqes

With the advancement of technology, the use of ATM and credit cards are increased. Cyber fraud and theft are the kinds of threat which result in using these Technologies. It is therefore inevitable to use fraud detection algorithms to prevent fraudulent use of bank cards. Credit card fraud can be thought of as a form of identity theft that consists of an unauthorized access to another person's ...

متن کامل

Application of Recursive Least Squares to Efficient Blunder Detection in Linear Models

In many geodetic applications a large number of observations are being measured to estimate the unknown parameters. The unbiasedness property of the estimated parameters is only ensured if there is no bias (e.g. systematic effect) or falsifying observations, which are also known as outliers. One of the most important steps towards obtaining a coherent analysis for the parameter estimation is th...

متن کامل

Identification of outliers types in multivariate time series using genetic algorithm

Multivariate time series data, often, modeled using vector autoregressive moving average (VARMA) model. But presence of outliers can violates the stationary assumption and may lead to wrong modeling, biased estimation of parameters and inaccurate prediction. Thus, detection of these points and how to deal properly with them, especially in relation to modeling and parameter estimation of VARMA m...

متن کامل

A statistical test for outlier identification in data envelopment analysis

In the use of peer group data to assess individual, typical or best practice performance, the effective detection of outliers is critical for achieving useful results. In these ‘‘deterministic’’ frontier models, statistical theory is now mostly available. This paper deals with the statistical pared sample method and its capability of detecting outliers in data envelopment analysis. In the prese...

متن کامل

Outlier Detection Using Extreme Learning Machines Based on Quantum Fuzzy C-Means

One of the most important concerns of a data miner is always to have accurate and error-free data. Data that does not contain human errors and whose records are full and contain correct data. In this paper, a new learning model based on an extreme learning machine neural network is proposed for outlier detection. The function of neural networks depends on various parameters such as the structur...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016